Overview

In this homework assignment we explored, analyzed and modeled a data set containing information on approximately 12,000 commercially available wines. The variables were mostly related to the chemical properties of the wine being sold. The response variable was the number of sample cases of wine that were purchased by wine distribution companies after a sample.

Initially, we examined the data for any problems that may have existed, such as missing data, outliers, and multi-collinearity. Next we took the necessary steps to clean the data, and built two poisson regressions, two negative binomial regressions, and two multivariate linear regression models using the training dataset.

We trained the models and evaluated them based on how well they performed against the provided evaluation data. Finally, we selected a final model that provided the best balance between accuracy and simplicity to predict the number of cases of wine sold given certain properties of the wine.


1. Data Exploration

The training dataset contained 12795 observations of 17 predictor variables, where each record represented a commercially available wine.

These variables included measures of acidity and amounts of various chemical compounds, as well as qualitative and marketing-related data such as reviewer starts and consumer responses to label design.

The prediction dataset contained 3335 observations over the same predictor variables.

  • Target: Number of Cases Purchased
  • AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average
  • Alcohol: Alcohol Content
  • Chlorides: Chloride content of wine
  • CitricAcid: Citric Acid Content
  • Density: Density of Wine
  • FixedAcidity: Fixed Acidity of Wine
  • FreeSulfurDioxide: Sulfur Dioxide content of wine
  • LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don’t like the design.
  • ResidualSugar: Residual Sugar of wine
  • Stars: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor. A high number of stars suggests high sales
  • Sulphates: Sulfate content of wine
  • TotalSulfurDioxide: Total Sulfur Dioxide of Wine
  • VolatileAcidity: Volatile Acid content of wine
  • pH: pH of wine

1.1 Summary Statistics

In order to explore summary stats and distribution characteristics of our dataset, we needed to first conduct some basic transformations and cleanup:

  • The ‘training’ dataset contained a single response variable target, a numeric variable indicating the number of cases purchased.
  • The ‘prediction’ dataset contained no values for target, suggesting this data might be used for prediction rather than validation and evaluation of model performance. For clarity we’ll renamed this this dataset ‘prediction’ instead and created a separate validation hold-out from the training data.
  • There was a numeric index column labeling the observations which could be excluded from the models.
  • 3335 observations (or 21%) of the total dataset had been set aside for prediction.
  • The combined training and prediction datasets consist of 16130 observations containing 14 predictor variables.

While exploring this data, we made the following observations:

  • Data contained only numeric values.
  • 4 variables were discrete (stars, labelappeal, acidindex, target).
  • 8 variables out of 14 contained missing values.
  • target (number of cases purchased) varied between 0 and 8.
Summary Statistics
variable complete_rate n_missing min max
acidindex 1.00 0 4.00 17.00
alcohol 0.95 838 -4.70 26.50
chlorides 0.95 776 -1.17 1.35
citricacid 1.00 0 -3.24 3.86
density 1.00 0 0.89 1.10
fixedacidity 1.00 0 -18.20 34.40
freesulfurdioxide 0.95 799 -563.00 623.00
labelappeal 1.00 0 -2.00 2.00
ph 0.97 499 0.48 6.21
residualsugar 0.95 784 -128.30 145.40
stars 0.74 4200 1.00 4.00
sulphates 0.91 1520 -3.13 4.24
totalsulfurdioxide 0.95 839 -823.00 1057.00
volatileacidity 1.00 0 -2.83 3.68

1.2 Distribution

One of the first characteristics that stood out was the presence of negative values for many chemical compounds, and the relative normality of their distributions. This suggested they had already been power-transformed to produce normal distributions for modeling.

Variables related to sugars, chlorides, acidity, sulfides and sulfates all seemed to fall in this category. Considering that were are analyzing very tiny amounts of chemical compounds, we might correctly assume their natural distributions may be highly skewed.

The variables acidindex (proprietary method of testing total acidity of wine by using a weighted average) seemed to be slightly right-skewed.


1.3 Boxplots

Graphing our variables distributions with boxplots didn’t identify any outliers we needed to deal with.

We did see a relationship between stars/labelappealand the target variable. As labelappeal increased, the target variable also went up - suggesting a positive relationship between label design appeal and cases purchased.

We also noticed that missing/NA values for stars was associated with low values for target. Using the hint from from the assignment that “sometimes, the fact that a variable is missing is actually predictive of the target”, we won’t discard or impute over the missing values for stars.


1.4 Scatter Plots

Graphing our variables distributions with scatterplots helped us understand relationships between our independent variables and the target variable.

ADD TEXT


1.5 Correlation Matrix

In the correlation plot below, we see that variables Star and LabelAppeal are the most positively correlated to the response variable (more stars, as a result, more purchases/ better label, more purchases). There is slight negative correlation between AcidIndex and the target variable. In terms of multicollinearity, we don’t see high correlation between variables and there is a chance that we won’t need to deal with it in our models.


2. Data Preparation

2.1 Transformations, Outliers

We tried exponentiation of related to sugars, chlorides, acidity, sulfides and sulfates variables by the natural log and other values, but did not arrive at an obvious or consistent transformation approach - so we may not be able to interpret model results on the scale of the original values for these variables.

COULD YOU ADD THE TRANSFORMATIONS AND MAKE THE PARAGRAPH AS CONCLUSION?

Mention outliers

2.2 Missing Data

Next we’ll find and impute any missing data. There are 8 predictor variables that contain NAs:

Missing Data
is_na pct
stars 4200 0.26
sulphates 1520 0.09
residualsugar 784 0.05
chlorides 776 0.05
freesulfurdioxide 799 0.05
totalsulfurdioxide 839 0.05
alcohol 838 0.05
ph 499 0.03

Heeding the warning in the assignment, “sometimes, the fact that a variable is missing is actually predictive of the target”, we’ll consider each of these variables carefully. While there may be data “missing completely at random” (MCAR) that we wish to impute, this may not always be the case.


2.2.1 Missing Data - Stars

The predictor Stars suggests that out of 16,000 wine samples, about 25% have never been professionally reviewed. If we assume the existence of a review has some impact on the sales of a wine brand (whatever the reviewer’s sentiment), then imputing mean or predicted values here might distort our model. Therefore, we’ll simply preserve the NAs as the model functions automatically adjust for these observations.

However, we’ll convert Stars from a numeric to a factor to enable further analysis.


2.2.2 Missing Data - Chemical Compounds

Next we consider some of the missing chemical compounds in our wines; alcohol, sugars, chlorides, sulfites and sulfates, and measures such as ph.

First, can safely assume that all wines in this dataset have an actual ph score greater than zero (which would represent the most acidic rank, such as powerful industrial acids.) We’ll want to impute more reasonable values for these.

Based on some reading into the organic wines segment, there is a growing demand in the market for specialty products such as low-sulfite, low-sugar and low-alcohol wines. However, this still represents a very small segment of the overall market, and chemically it’s not likely for these compounds to be completely absent from the final product.

Additionally, the predictors freesulfurdioxide and totalsulfurdioxide are linked - the amount of ‘Free’ SO2 in wine is always a subset of the ‘Total’ S02 present. We only identified 59 cases where both these values were NA, while over 1500 cases had missing values for only one or the other.

Based on these observations, we’ll use the MICE imputation method to predict and impute the missing values for residualsugar, chlorides, freesulphurdioxide, totalsulfurdioxide, sulphates, alchohol and ph.

Target/source labels and non-chemical predictors labelappeal and stars were excluded as predictors for the imputation.


2.3 Data Sparseness - Label Appeal

labelappeal is a numeric score of consumer ratings for a wine brand’s label design. It has also been pre-transformed to produce a normal distribution for modeling; however this is a very sparse variable with nearly half the cases having a value of zero.

This may be candidate for handling with Zero-Inflated models. We won’t change the values here, but will convert labelappeal from a numeric to a factor.


2.4 Examine Final Dataset

We now have reasonably imputed values, and nearly-normal distributions for our numeric predictors, taking special note of the frequency of zero values for labelappeal and stars.

Final Dataset - Missing and Zeros
variable n_missing n_zero
acidindex 0 0
alcohol 0 5
chlorides 0 9
citricacid 0 151
density 0 0
fixedacidity 0 47
freesulfurdioxide 0 14
labelappeal 0 7087
ph 0 0
residualsugar 0 6
stars 4200 NA
sulphates 0 32
totalsulfurdioxide 0 11
volatileacidity 0 22


2.5 Split Datasets

With transformations complete, we split back into training and prediction datasets based on our source_flag, and create a 15% validation hold-out from the training data.


3. Build Models

3.1 Poisson Regression 1

Poisson Regression assumes that the variance and mean of our dependent variable target are roughly equal, otherwise we may be looking at over- or under-dispersion.

pr1 <- glm(target ~ ., family = 'poisson', data = df_train)
## 
## Call:
## glm(formula = target ~ ., family = "poisson", data = df_train)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.162e+00  2.290e-01   5.075 3.88e-07 ***
## fixedacidity        7.689e-04  9.341e-04   0.823 0.410419    
## volatileacidity    -2.510e-02  7.430e-03  -3.379 0.000728 ***
## citricacid          4.399e-04  6.736e-03   0.065 0.947927    
## residualsugar       4.273e-05  1.723e-04   0.248 0.804120    
## chlorides          -2.557e-02  1.849e-02  -1.383 0.166768    
## freesulfurdioxide   4.841e-05  3.917e-05   1.236 0.216533    
## totalsulfurdioxide  3.310e-05  2.528e-05   1.309 0.190441    
## density            -3.003e-01  2.208e-01  -1.360 0.173821    
## ph                  1.317e-03  8.682e-03   0.152 0.879418    
## sulphates          -5.790e-03  6.276e-03  -0.922 0.356283    
## alcohol             4.930e-03  1.569e-03   3.143 0.001673 ** 
## labelappeal-1       2.478e-01  4.829e-02   5.132 2.86e-07 ***
## labelappeal0        4.807e-01  4.718e-02  10.189  < 2e-16 ***
## labelappeal1        6.331e-01  4.781e-02  13.241  < 2e-16 ***
## labelappeal2        7.787e-01  5.253e-02  14.823  < 2e-16 ***
## acidindex          -4.902e-02  5.311e-03  -9.231  < 2e-16 ***
## stars2              3.138e-01  1.559e-02  20.126  < 2e-16 ***
## stars3              4.275e-01  1.710e-02  24.999  < 2e-16 ***
## stars4              5.368e-01  2.347e-02  22.871  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7273.5  on 8015  degrees of freedom
## Residual deviance: 4805.2  on 7996  degrees of freedom
##   (2881 observations deleted due to missingness)
## AIC: 28734
## 
## Number of Fisher Scoring iterations: 5
AIC 28733.84
Dispersion 0.42
Log-Lik -14346.92

We note that our model has generated ‘dummies’ from our categorical variables labelappeal and stars, and of the 20 total predictors, all but five have statistical significance.

Notably, our Dispersion Parameter is 0.42, which suggests a degree of under-dispersion in the data.

Diagnostics

By graphing our target values (green) against our predicted values (blue) we can easily see this model tends to under-predict the higher count levels, and wildly over-predict the lower count levels.


3.2 Poisson Regression 2

We’ll build a Zero-Inflated Poisson model to handle the large number of zero values in our labelappeal and stars predictors, to see if we can improve model accuracy.

pr2 <- zeroinfl(target ~ . | ., data=df_train, dist = 'poisson')
## 
## Call:
## zeroinfl(formula = target ~ . | ., data = df_train, dist = "poisson")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.17641 -0.27896  0.04181  0.35555  3.81291 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         9.959e-01  2.343e-01   4.250 2.14e-05 ***
## fixedacidity        5.690e-04  9.565e-04   0.595 0.551905    
## volatileacidity    -1.139e-02  7.570e-03  -1.504 0.132552    
## citricacid         -7.126e-04  6.845e-03  -0.104 0.917085    
## residualsugar      -5.670e-05  1.746e-04  -0.325 0.745375    
## chlorides          -1.721e-02  1.887e-02  -0.912 0.361875    
## freesulfurdioxide   1.035e-05  3.957e-05   0.262 0.793581    
## totalsulfurdioxide -7.379e-06  2.508e-05  -0.294 0.768616    
## density            -3.055e-01  2.261e-01  -1.351 0.176636    
## ph                  7.151e-03  8.840e-03   0.809 0.418593    
## sulphates           1.832e-03  6.368e-03   0.288 0.773549    
## alcohol             6.182e-03  1.590e-03   3.888 0.000101 ***
## labelappeal-1       3.000e-01  5.050e-02   5.940 2.85e-09 ***
## labelappeal0        5.688e-01  4.935e-02  11.525  < 2e-16 ***
## labelappeal1        7.471e-01  4.999e-02  14.945  < 2e-16 ***
## labelappeal2        9.034e-01  5.460e-02  16.545  < 2e-16 ***
## acidindex          -1.730e-02  5.572e-03  -3.104 0.001907 ** 
## stars2              1.313e-01  1.640e-02   8.003 1.21e-15 ***
## stars3              2.320e-01  1.783e-02  13.013  < 2e-16 ***
## stars4              3.341e-01  2.406e-02  13.888  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.010e+01  2.736e+00  -3.691 0.000224 ***
## fixedacidity       -9.712e-03  1.070e-02  -0.908 0.363916    
## volatileacidity     3.231e-01  8.611e-02   3.752 0.000176 ***
## citricacid         -6.911e-03  7.743e-02  -0.089 0.928880    
## residualsugar      -3.192e-03  2.001e-03  -1.596 0.110549    
## chlorides           1.986e-01  2.212e-01   0.898 0.369129    
## freesulfurdioxide  -1.108e-03  4.600e-04  -2.409 0.016001 *  
## totalsulfurdioxide -1.158e-03  2.937e-04  -3.941 8.12e-05 ***
## density             1.800e+00  2.604e+00   0.691 0.489368    
## ph                  1.747e-01  1.015e-01   1.721 0.085183 .  
## sulphates           2.080e-01  7.249e-02   2.869 0.004121 ** 
## alcohol             2.712e-02  1.755e-02   1.545 0.122236    
## labelappeal-1       5.634e-01  6.157e-01   0.915 0.360146    
## labelappeal0        1.264e+00  5.979e-01   2.114 0.034477 *  
## labelappeal1        2.022e+00  6.029e-01   3.354 0.000795 ***
## labelappeal2        2.721e+00  6.813e-01   3.994 6.51e-05 ***
## acidindex           5.649e-01  4.863e-02  11.616  < 2e-16 ***
## stars2             -3.703e+00  3.544e-01 -10.450  < 2e-16 ***
## stars3             -1.842e+01  3.436e+02  -0.054 0.957251    
## stars4             -1.843e+01  6.972e+02  -0.026 0.978916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 50 
## Log-likelihood: -1.387e+04 on 40 Df
AIC 27818.73
Dispersion 0.31
Log-Lik -13869.37

Diagnostics

Using a Zero-Inflated model, the Dispersion Parameter drops significantly, but we are getting a better overall result for counts of 3 or more. By graphing our target values (green) against our predicted values (blue) we can see we are getting much greater accuracy rate for most of the mid- and upper counts.

Notably, we are still under-predicting counts of 1-2, and greatly over-predicting counts of zero.


3.3 Negative Binomial Regression 1

Generally, we would use Negative Binomial Regression in cases of over-dispersion (where the variance of our dependent variable is significantly greater than the mean.) This does not appear to be the case with our dataset, but we’ll apply it here and examine the results:

nb1 <- glm.nb(target ~ ., data = df_train)
## 
## Call:
## glm.nb(formula = target ~ ., data = df_train, init.theta = 137389.9036, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.162e+00  2.290e-01   5.075 3.88e-07 ***
## fixedacidity        7.689e-04  9.341e-04   0.823 0.410422    
## volatileacidity    -2.510e-02  7.430e-03  -3.379 0.000728 ***
## citricacid          4.399e-04  6.736e-03   0.065 0.947928    
## residualsugar       4.273e-05  1.723e-04   0.248 0.804114    
## chlorides          -2.557e-02  1.849e-02  -1.383 0.166771    
## freesulfurdioxide   4.841e-05  3.917e-05   1.236 0.216537    
## totalsulfurdioxide  3.310e-05  2.528e-05   1.309 0.190443    
## density            -3.003e-01  2.208e-01  -1.360 0.173824    
## ph                  1.317e-03  8.682e-03   0.152 0.879427    
## sulphates          -5.790e-03  6.276e-03  -0.922 0.356283    
## alcohol             4.930e-03  1.569e-03   3.143 0.001674 ** 
## labelappeal-1       2.478e-01  4.829e-02   5.132 2.86e-07 ***
## labelappeal0        4.807e-01  4.718e-02  10.189  < 2e-16 ***
## labelappeal1        6.331e-01  4.781e-02  13.241  < 2e-16 ***
## labelappeal2        7.787e-01  5.253e-02  14.823  < 2e-16 ***
## acidindex          -4.902e-02  5.311e-03  -9.231  < 2e-16 ***
## stars2              3.138e-01  1.559e-02  20.126  < 2e-16 ***
## stars3              4.275e-01  1.710e-02  24.999  < 2e-16 ***
## stars4              5.368e-01  2.347e-02  22.870  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(137389.9) family taken to be 1)
## 
##     Null deviance: 7273.4  on 8015  degrees of freedom
## Residual deviance: 4805.2  on 7996  degrees of freedom
##   (2881 observations deleted due to missingness)
## AIC: 28736
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  137390 
##           Std. Err.:  199804 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -28693.98
AIC 28735.98
Dispersion 0.42
Log-Lik -14346.99

Diagnostics

As expected, the Negative Binomial Regression does not outperform the Poisson.


3.4 Negative Binomial Regression 2

We’ll build a Zero-Inflated Negative Binomial model to handle the large number of zero values in our labelappeal and stars predictors, to see if we can improve model accuracy.

nb2 <- zeroinfl(target ~ . | ., data=df_train, dist = 'negbin')
## 
## Call:
## zeroinfl(formula = target ~ . | ., data = df_train, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.17642 -0.27896  0.04182  0.35556  3.81280 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         9.960e-01  2.343e-01   4.250 2.14e-05 ***
## fixedacidity        5.688e-04  9.565e-04   0.595 0.552038    
## volatileacidity    -1.139e-02  7.570e-03  -1.504 0.132580    
## citricacid         -7.134e-04  6.845e-03  -0.104 0.916996    
## residualsugar      -5.673e-05  1.746e-04  -0.325 0.745266    
## chlorides          -1.721e-02  1.887e-02  -0.912 0.361838    
## freesulfurdioxide   1.036e-05  3.957e-05   0.262 0.793415    
## totalsulfurdioxide -7.381e-06  2.508e-05  -0.294 0.768566    
## density            -3.056e-01  2.261e-01  -1.352 0.176524    
## ph                  7.151e-03  8.840e-03   0.809 0.418547    
## sulphates           1.833e-03  6.368e-03   0.288 0.773476    
## alcohol             6.182e-03  1.590e-03   3.888 0.000101 ***
## labelappeal-1       2.999e-01  5.049e-02   5.940 2.85e-09 ***
## labelappeal0        5.687e-01  4.935e-02  11.525  < 2e-16 ***
## labelappeal1        7.471e-01  4.999e-02  14.945  < 2e-16 ***
## labelappeal2        9.034e-01  5.460e-02  16.546  < 2e-16 ***
## acidindex          -1.730e-02  5.572e-03  -3.104 0.001908 ** 
## stars2              1.313e-01  1.640e-02   8.003 1.21e-15 ***
## stars3              2.320e-01  1.783e-02  13.013  < 2e-16 ***
## stars4              3.341e-01  2.406e-02  13.888  < 2e-16 ***
## Log(theta)          1.750e+01  3.667e+00   4.772 1.83e-06 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.010e+01  2.736e+00  -3.691 0.000224 ***
## fixedacidity       -9.713e-03  1.070e-02  -0.908 0.363850    
## volatileacidity     3.231e-01  8.611e-02   3.752 0.000175 ***
## citricacid         -6.918e-03  7.743e-02  -0.089 0.928804    
## residualsugar      -3.193e-03  2.000e-03  -1.596 0.110456    
## chlorides           1.986e-01  2.212e-01   0.898 0.369105    
## freesulfurdioxide  -1.108e-03  4.600e-04  -2.408 0.016039 *  
## totalsulfurdioxide -1.158e-03  2.937e-04  -3.941 8.10e-05 ***
## density             1.802e+00  2.604e+00   0.692 0.488998    
## ph                  1.748e-01  1.015e-01   1.722 0.085113 .  
## sulphates           2.080e-01  7.249e-02   2.870 0.004109 ** 
## alcohol             2.711e-02  1.755e-02   1.545 0.122296    
## labelappeal-1       5.607e-01  6.142e-01   0.913 0.361246    
## labelappeal0        1.261e+00  5.964e-01   2.115 0.034412 *  
## labelappeal1        2.020e+00  6.014e-01   3.359 0.000784 ***
## labelappeal2        2.718e+00  6.799e-01   3.997 6.41e-05 ***
## acidindex           5.649e-01  4.863e-02  11.616  < 2e-16 ***
## stars2             -3.703e+00  3.544e-01 -10.449  < 2e-16 ***
## stars3             -1.842e+01  3.442e+02  -0.054 0.957324    
## stars4             -1.843e+01  6.976e+02  -0.026 0.978925    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 39804271.402 
## Number of iterations in BFGS optimization: 58 
## Log-likelihood: -1.387e+04 on 41 Df
AIC 27820.73
Dispersion 0.31
Log-Lik -13869.37

Diagnostics

The Zero-Inflated Negative Binomial model sees similar improvement as with the Zero-Inflated Poisson, but as before does not outperform the Poisson.


3.5 Multiple Linear Regression 1

For our first Multiple Linear Regression, we’ll use all predictors.

lm1 <- lm(target ~ ., data=df_train)
## 
## Call:
## lm(formula = target ~ ., data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1886 -0.5311  0.0995  0.7384  3.2254 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.771e+00  4.955e-01   7.610 3.05e-14 ***
## fixedacidity        2.894e-03  2.041e-03   1.418   0.1563    
## volatileacidity    -9.511e-02  1.620e-02  -5.871 4.50e-09 ***
## citricacid          2.022e-03  1.476e-02   0.137   0.8910    
## residualsugar       1.683e-04  3.765e-04   0.447   0.6549    
## chlorides          -9.961e-02  4.044e-02  -2.463   0.0138 *  
## freesulfurdioxide   1.703e-04  8.545e-05   1.993   0.0463 *  
## totalsulfurdioxide  1.234e-04  5.506e-05   2.240   0.0251 *  
## density            -1.090e+00  4.827e-01  -2.259   0.0239 *  
## ph                  8.733e-03  1.895e-02   0.461   0.6449    
## sulphates          -1.968e-02  1.372e-02  -1.435   0.1515    
## alcohol             1.875e-02  3.409e-03   5.502 3.87e-08 ***
## labelappeal-1       5.110e-01  7.801e-02   6.551 6.07e-11 ***
## labelappeal0        1.220e+00  7.630e-02  15.991  < 2e-16 ***
## labelappeal1        1.860e+00  7.892e-02  23.562  < 2e-16 ***
## labelappeal2        2.628e+00  9.843e-02  26.701  < 2e-16 ***
## acidindex          -1.714e-01  1.101e-02 -15.567  < 2e-16 ***
## stars2              9.729e-01  3.094e-02  31.449  < 2e-16 ***
## stars3              1.475e+00  3.626e-02  40.671  < 2e-16 ***
## stars4              2.072e+00  5.662e-02  36.602  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.137 on 7996 degrees of freedom
##   (2881 observations deleted due to missingness)
## Multiple R-squared:  0.4642, Adjusted R-squared:  0.463 
## F-statistic: 364.7 on 19 and 7996 DF,  p-value: < 2.2e-16
AIC 24825.60
Adj R2 0.46

Diagnostics


3.6 Multiple Linear Regression 2

For our second Multiple Linear Regression, we’ll add stepwise feature selection.

lm2_all <- lm(target ~ ., data=df_train)
lm2 <- stepAIC(lm2_all, trace=FALSE, direction='both')
## 
## Call:
## lm(formula = target ~ fixedacidity + volatileacidity + chlorides + 
##     freesulfurdioxide + totalsulfurdioxide + density + sulphates + 
##     alcohol + labelappeal + acidindex + stars, data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1881 -0.5326  0.1010  0.7350  3.2349 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.803e+00  4.911e-01   7.744 1.08e-14 ***
## fixedacidity        2.897e-03  2.040e-03   1.420   0.1557    
## volatileacidity    -9.509e-02  1.619e-02  -5.874 4.41e-09 ***
## chlorides          -9.998e-02  4.043e-02  -2.473   0.0134 *  
## freesulfurdioxide   1.705e-04  8.542e-05   1.996   0.0460 *  
## totalsulfurdioxide  1.236e-04  5.503e-05   2.247   0.0247 *  
## density            -1.091e+00  4.824e-01  -2.261   0.0238 *  
## sulphates          -1.966e-02  1.372e-02  -1.433   0.1518    
## alcohol             1.871e-02  3.407e-03   5.492 4.10e-08 ***
## labelappeal-1       5.115e-01  7.799e-02   6.558 5.78e-11 ***
## labelappeal0        1.221e+00  7.629e-02  16.000  < 2e-16 ***
## labelappeal1        1.860e+00  7.891e-02  23.570  < 2e-16 ***
## labelappeal2        2.629e+00  9.839e-02  26.725  < 2e-16 ***
## acidindex          -1.717e-01  1.096e-02 -15.667  < 2e-16 ***
## stars2              9.730e-01  3.092e-02  31.465  < 2e-16 ***
## stars3              1.475e+00  3.625e-02  40.685  < 2e-16 ***
## stars4              2.073e+00  5.660e-02  36.622  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.137 on 7999 degrees of freedom
##   (2881 observations deleted due to missingness)
## Multiple R-squared:  0.4642, Adjusted R-squared:  0.4631 
## F-statistic: 433.1 on 16 and 7999 DF,  p-value: < 2.2e-16
AIC 24820.04
Adj R2 0.46

Diagnostics


3.7 Lasso

We used the zipath() to fit zero-inflated poisson regression models with variable selection using lasso regularization. The count and the logit models both start with all the predictor variables and we used the coefficients parameters that generated the smallest AIC value for the model.

Zero-Inflated Model Coefficients
count zero
(Intercept) 1.0410 -5.4758
fixedacidity 0.0006 0.0000
volatileacidity -0.0119 0.1949
citricacid -0.0009 0.0000
residualsugar 0.0000 0.0000
chlorides -0.0177 0.0000
freesulfurdioxide 0.0000 -0.0002
totalsulfurdioxide 0.0000 -0.0007
density -0.3270 0.0000
ph 0.0059 0.0000
sulphates 0.0012 0.0580
alcohol 0.0060 0.0000
labelappeal-1 0.2845 -0.2619
labelappeal0 0.5471 0.0000
labelappeal1 0.7245 0.3875
labelappeal2 0.8797 0.3212
acidindex -0.0168 0.4609
stars2 0.1340 -2.4718
stars3 0.2321 -3.3585
stars4 0.3351 -2.6136

Theta Estimate:

The coefficients for the count model that survive the regularization process include the dummy variables for labelappeal, stars, and the variables density.

  • labelappeal - the dummy variables derived from label appeal are strong indicators of the number of cases that will be purchased
  • stars - the dummy variables derived from wine ratings are strong indicators of number of cases that will be purchased
  • density - is a negative indicator of the number of cases purchased by distributors suggesting that lighter wines are more popular than full-bodied wines
  • the following variables drop out of the final model residualsugar, totalsulfurdioxide, freesulfurdioxide, and fixedacidity

The coefficients for the zero inflation model that survived the regularization process include stars, labelappeal-1, labelappeal1 ,and volatileacidity.

  • stars - the dummy variables derived from wine ratings are strong indicators of number of cases that will be purchased
  • labelappeal-1 is negative and labelappeal1 is positive with no other label related dummy variable being included in the model. This would suggest that label aesthetics only count at the margins between positive and negative customer sentiment.
  • volatileacidity - is the only other variable coefficient that is included in the final model. With a negative coefficient lower volatile acid content is preferred when making a purchasing decision.

Diagnostics


Model Evaluation

Model mape smape mase mpe RMSE AIC Adjusted R2 F-statistic
Poisson Regression 1 NaN NaN 1.0339 NaN 2.6022 28733.84 NA NA
Poisson Regression 2 NaN NaN 0.4355 NaN 1.4496 27818.73 NA NA
Negative Binomial 1 NaN NaN 1.0339 NaN 2.6022 28735.98 NA NA
Negative Binomial 2 NaN NaN 0.4355 NaN 1.4496 27820.73 NA NA
Multiple Linear Model 1 Inf 32.9110 0.5009 -Inf 2.6022 24825.60 0.4630 364.6602
Multiple Linear Model 2 Inf 32.9116 0.5009 -Inf 1.1665 24820.04 0.4631 433.1453
Lasso Inf 32.5251 0.4906 -Inf 1.1563 27948.07 NA NA

Predictions


Conclusion


Appendix

References

‘Total Sulfur Dioxide – Why it Matters, Too!’
Iowa State University
https://www.extension.iastate.edu/wine/total-sulfur-dioxide-why-it-matters-too/

R Code